Human beings learn and reinforce behaviour with actions that lead to rewarding outcomes. Many aspects of human behaviour can thus be categorised as search problems in this process. Right from foraging for food or resources for one’s survival to searching through a hypothesis space and even in interpersonal relationships. The horizons of possible actions and their consecutive rewards is what has been shaping the learning behaviour in humans, where we search through various permutations and combinations of probable actions to generate favourable results and this further shapes our experience through learning. However, in this process, it is vital that we exercise caution and full use of our intellect by learning to balance the dual goals of exploring the unknown options while also employing them in a way that generates favourable outcomes and immediate returns to further shape our learnt behaviour for life.
This further takes us to the exploration-exploitation dilemma using the multi-armed bandit problems. To understand this better, imagine yourself in front of a row of slot machines, where one plays to maximize their rewards by independently learning the reward distributions of each option. The way to maximize this is to understand how to effectively learn which arms of the machine are better to play (i.e., exploration) and at the same time knowing which high-value arms would maximize one’s reward (i.e., exploitation). While this is true in a gambler’s scenario in front of the slot machine, it is not sufficient in real-life scenarios to just know when to explore being under the constraints and limitations of time and/or resources; it is imperative to know where one should, thus, explore.
Our experiment is a replication study, based on the experiments by Wu et. al. (2017) on the way Generalization guides human exploration in vast decision-spaces. In the paper, the researchers introduce the spatially correlated multi-armed bandit as a paradigm to study how people use generalisation to guide their search in larger problem spaces than the traditionally used paradigm. According to the original paper, it was hypothesized that if function learning guides search behaviour, participants would perform better and learn faster in smooth environments, in which stronger spatial correlations reveal more information about nearby tiles.
We traverse on the same path of including the exploration-exploitation dilemma by increasing the reward expectations during uncertainty for both the payoff conditions of accumulation of the total number of points as well as the maximization of the reward by trying to find the highest number of the tile. However, deviating from the original experiment, we have changed the design from 2x2 between subject design across different search horizons of short (20 clicks) or long (40 clicks) per trial and across two different environments of Smooth or Rough (distribution of rewards). We have a within subject design and compare the payoff condition (Maximization vs. Accumulation) of subjects with a fixed search horizon (30 clicks).
Our replicated experiment aims to study the differences of the subjects on how they have performed on the experiment to search for rewards spread across the grid. By studying their click patterns, we aim to analyse the differences in the way participants search for rewards and study how the click distance varies in accumulation and maximization payoff condition trials.
In this experiment, we aim to study how humans use generalisation and spatial correlation to guide when and where to explore.
As the experiment is designed to yield similar rewards in a clustered environment spread across the grid, we put forth the research question that humans use this criterion and assume that similar states yield similar outcomes across the two payoff conditions of accumulation of points as well as finding out the maximum reward on the grid.
We study this by analysing the mean distance of clicks, which we assume is higher in the Maximization condition than in the Accumulation condition.
This is an open-platform experiment and the participants are recruited online with no special restrictions concerning the participants of the experiment. Each participant is presented with instructions, is clarified about the data privacy and has to provide consent to participation prior to the beginning of the experiment.
The experiment is a within-subject design with a single independent variable: payoff condition. As the dependent variable we measure the mean Manhattan distance/squared Manhattan distance of the clicks. The distance is measured on a continuous scale, while the payoff condition is measured as a binary factor.
Participants are presented with an 11x11 grid world, with 121 options where they can click on the tiles to reveal the corresponding reward. The rewards are spatially correlated so that the higher and lower rewards are clustered. As an additional aid to guide the search the grid is covered with a heat map with darker colours representing higher rewards, which is revealed upon clicking the tiles. The spatial correlation is designed using the kernels from the original paper. For each grid, the maximal reward is randomly drawn in the range from 65 to 85 so that it cannot be easily guessed.
This is a computer-based experiment. Each participant has to perform a simple computer task in which they explore areas on a grid and uncover rewards. These rewards are accumulated points per trial as well as the maximum point (found during each corresponding trial) that the participant unravels with each click, of the 30 possible clicks during each of the eight trials.
Participants are first welcomed and given general instructions about the experiment. The participant is then asked to perform a test trial of five clicks in order to familiarise with the experiment. Once the participant starts the experiment, they are presented with either an accumulation or a maximization trial at random, which are equally spaced out randomly ordered in the eight designed trials, such that there are four trials of each payoff condition. Each grid/ trial starts with a single tile revealed. Using the mouse, participants click on unexplored tiles to reveal a number corresponding to the number of points they gain.
During the accumulation condition, the participant’s goal is to gain as many points as possible by clicking on tiles in the grid, balancing exploration/exploitation. They then unveil points that are associated with the location on the grid on each tile. In this interactive trial, the total accumulated points are shown on the left of the screen, to inform the participant of the same. When the participant exhausts the given 30 clicks per trial, they are presented with a pop-up screen that informs them of the end of the trial and the experiment proceeds to the next trial from the given 8 trials.
During the maximization condition, the participant’s goal is to gain the highest possible point, with a belief that exploration goal is directed towards the global maximum with the 30 permissible clicks during the very trial. With each click, they unveil a tile from the 121 tiles on the 11x11 grid and their maximum point achieved is displayed on the left of the screen. This is also an interactive condition. As the participant unveils higher points, their maximum point achieved during the unveiling of tiles is displayed in real-time. When the participant exhausts the given 30 clicks per trial, they are presented with a pop-up screen that informs them of the end of the trial and the experiment proceeds to the next trial from the given 8 trials.
Upon conclusion of all of eight trials with appearances of the two payoff conditions four times each, the experiment ends with a post-test survey where participants are asked about their search strategy and can comment on the experiment. The data is then recorded on the _babe server and the participant data is saved on a CSV file, which is retrieved for statistical analysis by the experimenters.
The analysis of the participant data commences by defining outliers.
The exclusion criteria are defined based on the time elapsed as well as the mean distance between the clicks made by the participants. For a fair critical trial, we have set the allowance of 3 times the standard deviation of the time spent for the experiment (3xSD(Time spent)) as well as 3 times the standard deviation of the distance between subsequent clicks (3xSD(distance)). Participants who are too quick during trials are excluded from the analysis of the experiment data. The total time elapsed is considered, and subsequent exclusion from the entire data set is taken into account. The mean distance between tiles, from the click, gives us the way the participant has performed grid search. If participants have a very high deviation from the mean distance, they are excluded from the data pool. This means that they have not taken the experiment seriously and have randomly clicked on tiles across the grid, not adhering to the instructions. Similarly, if the participants have very low or near zero mean distance, they are excluded from the analysis of the experiment data as well. A very low or near zero mean distance suggests that the participant has clicked on the same tile and/ or only adjacent tiles repeatedly during the course of the experiment, thereby defying the instructions.
We begin the statistical analysis of the data by importing it from the babe server and computing the Manhattan distance - or the distance between two clicks. Since the horizon of each trial of our experiment is changed to a fixed one with 30 permissible clicks, each click by the participant is considered as a new click. 66 participants took part in the experiment.
The exclusion criteria are then applied to filter outliers that do not conform with our criteria. We then look at the distance with respect to the payoff condition of Accumulation vs Maximization and the number of clicks by the participants. We find the mean distance between clicks in the Accumulation and Maximization conditions. (Accumulation: 3.034245 ,Maximization: 3.533724)
We use this data to compute with the BRMS tool in R to fit Bayesian generalized (non-)linear multivariate multilevel models using Stan. In general, every parameter is summarized using the mean (‘Estimate’) and the standard deviation (‘Est.Error’) of the posterior distribution as well as two-sided 95% credible intervals (‘l-95% CI’ and ‘u-95% CI’) based on quantiles.
We see that the estimate of Maximization in the first model (3.53) is larger than the estimate of Accumulation (3.03) with a credible 95%-CI interval ([3.36, 3.71]). We can also see this in the second model, where the estimate of Maximization (3.53) is also larger than the estimate of Accumulation (3.03) with a credible interval ([3.25, 3.81]). This indicates that the Maximization payoff condition has a higher mean distance between clicks, which also means that the participants search globally across the grid to attain maximum rewards and not cluster their clicks along a particular area during the search.
Since the subjects are asked to complete eight grids in order to complete the experiment we also included a model with a by-participant random intercept and compared the models with the loo function. The loo_compare function shows: (elpd_diff: -336.6 ; se_diff = 25.5), therefore we prefer the second model (data_modeled_2).
In order to further compare within the two payoff conditions, we run the faintr package. Building on our findings from the brms package that the Maximization payoff condition surely has a higher mean distance thereby confirming that participants search globally, we aim to find by how much does this affect the way participants search across the grid. Using the faintr package on our second model (data_modeled_2), we find the outcome of comparing the two groups gives us a mean difference between the two goals or payoff conditions (Mean ‘higher - lower’: 0.4994). In order to confirm the effects, the P value (to check whether the higher condition of maximization is greater than the lower condition of accumulation) is positive (P(‘higher - lower’ > 0): 1). Findings suggest a positive P value greater than zero. This double testing of our research question further works to prove the same.
Using the results obtained via our statistical analysis of the data, we could successfully prove that the participants search more globally when trying to maximize their rewards than when they are trying to increase the already accumulated rewards. This matches with our research question.
In this paper, we took a look at how participants search differently in two payoff conditions. The researchers from the original paper had further parameters that they changed throughout the experiment and also the extra aspect of function learning. In the frame of this study project, this was too large, but it would be quite interesting to recreate these aspects of the original study.
+++++++
GitHub-repository: https://github.com/jl-schroeder/Generalization_XLAB_final_Project
Link to the experiment: https://decision-space-xplab.netlify.com/
+++++++
# --- importing everthing needed ---
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.18.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For improved execution time, we recommend calling
## Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
## although this causes Stan to throw an error on a few processors.
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
# set cores to use to the total number of cores (minimally 4)
options(mc.cores = max(parallel::detectCores(), 4))
# save a compiled version of the Stan model file
rstan_options(auto_write = TRUE)
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.8.0). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
## The following object is masked from 'package:rstan':
##
## loo
#install faintr with
devtools::install_github('michael-franke/bayes_mixed_regression_tutorial/faintr', build_vignettes = TRUE)
## Skipping install of 'faintr' from a github remote, the SHA1 (a8f46ace) has not changed since last install.
## Use `force = TRUE` to force installation
library(faintr)
#fixig the seed
set.seed(141)
# --- helper functions ---
#computes the manhattan block distance between to clicks
#due to the fixed horizon every 30 clicks is a new first click, without a distance
distManhattan <- function(x,y){
dist_Vector = double()
dist_X = c(0)
dist_Y = c(0)
x_prev = 0
y_prev = 0
t = 0
for (i in x){
if (isTRUE(t >= 30)){
t = 0
dist_X <- c(dist_X,0)
}
if (isTRUE(t >= 1)){
dist_X <- c(dist_X,abs(x_prev - i))
#print(dist_X)
}
x_prev = i
t = t + 1
}
t = 0
for (i in y){
if (isTRUE(t >= 30)){
t = 0
dist_Y <- c(dist_Y,0)
}
if (isTRUE(t >=1)){
dist_Y <- c(dist_Y,abs(y_prev - i))
#print(dist_Y)
}
y_prev = i
t = t + 1
}
dist_Vector = dist_X + dist_Y
return(dist_Vector)
}
# --- loading data ---
#creating a tibble with the data
data <- read_csv('results_79_Grid_search_Group_10.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## comments = col_logical(),
## education = col_character(),
## gender = col_character(),
## goal = col_character(),
## languages = col_character(),
## startDate = col_character(),
## trial_name = col_character()
## )
## See spec(...) for full column specifications.
## Warning: 3840 parsing failures.
## row col expected actual file
## 2161 comments 1/0/T/F/TRUE/FALSE fun experiment! 'results_79_Grid_search_Group_10.csv'
## 2162 comments 1/0/T/F/TRUE/FALSE fun experiment! 'results_79_Grid_search_Group_10.csv'
## 2163 comments 1/0/T/F/TRUE/FALSE fun experiment! 'results_79_Grid_search_Group_10.csv'
## 2164 comments 1/0/T/F/TRUE/FALSE fun experiment! 'results_79_Grid_search_Group_10.csv'
## 2165 comments 1/0/T/F/TRUE/FALSE fun experiment! 'results_79_Grid_search_Group_10.csv'
## .... ........ .................. ............... .....................................
## See problems(...) for more details.
# fixing the data type
data_mutated <- mutate(data, goal = factor(goal))
glimpse(data)
## Observations: 15,840
## Variables: 21
## $ submission_id <dbl> 929, 929, 929, 929, 929, 929, 929, 929, 929, ...
## $ age <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2...
## $ comments <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ education <chr> "Graduated High School", "Graduated High Scho...
## $ endTime <dbl> 1.563897e+12, 1.563897e+12, 1.563897e+12, 1.5...
## $ experiment_id <dbl> 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 7...
## $ final_points <dbl> 5, 15, 28, 45, 71, 88, 101, 126, 162, 192, 22...
## $ gender <chr> "male", "male", "male", "male", "male", "male...
## $ goal <chr> "Accumulation", "Accumulation", "Accumulation...
## $ horizonSize <dbl> 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 3...
## $ languages <chr> "Bulgarian, German", "Bulgarian, German", "Bu...
## $ number_of_clicks <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ participant_ID <dbl> 540113, 540113, 540113, 540113, 540113, 54011...
## $ startDate <chr> "Tue Jul 23 2019 17:37:56 GMT+0200 (W. Europe...
## $ startTime <dbl> 1.563896e+12, 1.563896e+12, 1.563896e+12, 1.5...
## $ timeSpent <dbl> 9.6526, 9.6526, 9.6526, 9.6526, 9.6526, 9.652...
## $ trial_name <chr> "grid_search", "grid_search", "grid_search", ...
## $ trial_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ value <dbl> 5, 10, 13, 17, 26, 17, 13, 25, 36, 30, 33, 31...
## $ x_coordinate <dbl> 5, 6, 7, 8, 9, 8, 8, 9, 10, 10, 10, 1, 0, 0, ...
## $ y_coordinate <dbl> 5, 5, 5, 5, 5, 6, 4, 6, 5, 4, 6, 5, 5, 6, 7, ...
# --- adding distance to the tibble ---
# distance
distance <- distManhattan(data_mutated$x_coordinate,data_mutated$y_coordinate)
# squared distance
distance2 <- distance^2
data_dist <- add_column(data_mutated,distance) %>% add_column(distance2)
data_dist
# --- clearing the data ---
#filter by time
data_dist
data_filtered<- filter(data_dist,timeSpent < 3*sd(timeSpent)+mean(timeSpent))
ggplot(data = data_dist, aes(participant_ID,timeSpent)) + geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
data_filtered
ggplot(data = data_filtered, aes(participant_ID,timeSpent)) + geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#filter for too high and too low mean distance
dataGrouped_goal <- group_by(data_filtered,participant_ID) %>% summarise(dist_mean = mean(distance))
dataGrouped_goal_filt <- filter(dataGrouped_goal,dist_mean <= sd(dist_mean)*3+median(dist_mean),dist_mean >=mean(dist_mean)-sd(dist_mean)*3)
dataGrouped_goal_filt <- c(dataGrouped_goal_filt$participant_ID)
data_filtered <- subset(data_filtered,participant_ID %in% dataGrouped_goal_filt)
dataGrouped_goal
data_filtereddataGrouped_goal <- ungroup(dataGrouped_goal)
ggplot(data = dataGrouped_goal, aes(participant_ID,dist_mean)) + geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
# --- looking a the distances ---
# distances with respect to the goal condition
dataGrouped_goal <- group_by(data_filtered,goal) %>% summarise(dist_mean <- mean(distance))
dataGrouped_goal
# squared distances with respect to the goal condition
dataGrouped_goal2 <- group_by(data_filtered,goal) %>% summarise(dist_mean <- mean(distance2))
dataGrouped_goal2
# distances with respect to the number of clicks
dataGrouped_clicks <- group_by(data_filtered,number_of_clicks) %>% summarise(dist_mean <- mean(distance))
dataGrouped_clicks
# squared distances with respect to the number of clicks
dataGrouped_clicks2 <- group_by(data_filtered,number_of_clicks)%>% summarise(dist_mean <- mean(distance2))
dataGrouped_clicks2
# squared distances with respect to the number of clicks and the goal condition combined
dataGrouped_combined <- group_by(data_filtered,goal,number_of_clicks) %>% summarise(dist_mean <- mean(distance2))
dataGrouped_combined
data_modeled <- brm(distance ~ goal,data_filtered)
## Compiling the C++ model
## Start sampling
data_modeled
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: distance ~ goal
## Data: data_filtered (Number of observations: 15360)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 3.03 0.04 2.96 3.10 2982 1.00
## goalMaximization 0.50 0.05 0.40 0.60 3573 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 3.12 0.02 3.08 3.15 4347 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
# second brm model adding random effects
data_modeled_2 <- brm(distance ~ goal + (1|participant_ID),data_filtered)
## Compiling the C++ model
## Start sampling
data_modeled_2
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: distance ~ goal + (1 | participant_ID)
## Data: data_filtered (Number of observations: 15360)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~participant_ID (Number of levels: 64)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 0.69 0.07 0.57 0.83 780 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 3.04 0.10 2.85 3.22 349 1.01
## goalMaximization 0.50 0.05 0.40 0.59 6748 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 3.04 0.02 3.01 3.08 9246 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Model_compare_1 = loo(data_modeled)
Model_compare_1
##
## Computed from 4000 by 15360 log-likelihood matrix
##
## Estimate SE
## elpd_loo -39251.1 140.2
## p_loo 4.7 0.2
## looic 78502.2 280.3
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
Model_compare_2 = loo(data_modeled_2)
Model_compare_2
##
## Computed from 4000 by 15360 log-likelihood matrix
##
## Estimate SE
## elpd_loo -38915.1 141.7
## p_loo 63.2 1.2
## looic 77830.3 283.5
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
loo_compare(Model_compare_1,Model_compare_2)
## elpd_diff se_diff
## data_modeled_2 0.0 0.0
## data_modeled -335.9 25.5
faintr_1 <- faintr::compare_groups(data_modeled,higher=list(goal="Maximization"),lower=list(goal="Accumulation"))
faintr_1
## Outcome of comparing groups:
## * higher: goal:Maximization
## * lower: goal:Accumulation
## Mean 'higher - lower': 0.4993
## 95% CI: [ 0.4022 ; 0.5956 ]
## P('higher - lower' > 0): 1
faintr_2 <- faintr::compare_groups(data_modeled_2,higher=list(goal="Maximization"),lower=list(goal="Accumulation"))
faintr_2
## Outcome of comparing groups:
## * higher: goal:Maximization
## * lower: goal:Accumulation
## Mean 'higher - lower': 0.4995
## 95% CI: [ 0.4053 ; 0.5956 ]
## P('higher - lower' > 0): 1